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ABSTRACT 

Nested dissection orderings are known to be very effective for solving the 
sparse positive definite linear systems which arise from n by n grid problems. 
In this paper nested dissection is shown to be the final step of incomplete nested 
dissection , an ordering which corresponds to the premature termination of dissec- 
tion. Analyses of the arithmetic and storage requirements for incomplete nested 
dissection are given and the ordering is shown to be competitive with nested dis- 
section under certain conditions . 
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A VARIANT OF NESTED DISSECTION 
FOR SOLVING n BY n GRID PROBLEMS 


1. INTRODUCTION 

Recently there has been considerable effort ([1], [2]) to demonstrate the 

efficiency of an ordering strategy known as nested dissection for solving the sparse 

symmetric positive definite systems of linear equations associated with an n by 

2 

n grid consisting of (n-1) small squares, called elements . Any numbering of 

2 2 2 

the grid points from 1 to n yields the n by n matrix equation 


( 1 . 1 ) 


Ax = b, 


where A^^ = 0 unless grid points i and j belong to the same element. 

Although the analysis in this paper is for a square region, that is not as 
restrictive as it might seem. For example, a general region may be "subdivided" or 
"substructured" into a union of smaller regions, many of which will be squares if 
the subdivision is chosen properly, 

T 

The system (1.1) is solved by first factoring A into LL where L is 

T 

lower triangular and then by solving the systems Ly = b and L x = y. V/hen the 
factorization of A is carried out, it usually suffers from "fill"; that is, some 
locations of A that were zero become nonzero in the triangular factors. Thus we 


are led to consider the equivalent problem, 




(PAP )Px = Pb, 






where P is a permutation matrix chosen so as to reduce arithmetic requirements , 
to reduce the fill, or to achieve other objectives. 

It is well-known ([1], [2]) that the nested dissection ordering reduces the 

4 

factorization stage operation count from 0(n ) for the row by row or band-oriented 

3 

ordering to 0(n )j similarly the fl ‘ ' occurring during the factorization is 
3 2 

reduced from OCn ) to 0(n log^n) . Furthermore, nested dissection has been 
shown to be optimal in the asymptotic sense ([1], [4]). 

In essence, the nested dissection algorithm repeatedly applies a basic step 
to square sufameshes of the original mesh. The basic step consists of choosing a 
separating "+" (see Figure 1.1) which, as nearly as possible, equally divides the 
subraesh into quadrants. The nodes in the quadrants are numbered before those on the 


Figure 1.1. A basic step in nested dissection, indicating 
the separating cross. 


cross. This basic step is applied successively until it is no longer possible to 
subdivide the mesh. 

The purpose of this paper is to study the consequences of terminating the mesh 



subdivision before completion. That is, at some stage, do not subdivide the quad- 
rants further, but simply use a row by row, band-oriented ordering for the nodes 
in each quadrant. This idea is not completely newj the related idea of substruc- 
turing has been used for many years in structural engineering applications [5]. 
However, a careful analysis has never appeared in the literature. 

There are at least three reasons for studying this ordering: the arithmetic 

and storage requirements have never been analysed; the overhead for handling the 
sparse matrix components is simpler than for nested dissection; and vector compu- 
ters like the Control Data Corporation STAR- 100 and the Texas Instruments, Inc. 

ASC can be better utilized because the lengths of vector ouerands can be greater 
than for nested dissection [3]. 

In section 2 a detailed description of the variant of nested dissection is 
given. The operation counts developed are for multiplicative operations. Section 
3 contains information regarding the arithmetic and storage, requirements of the 
new algorithm. It is shown that this variant is quite competitive with nested 


dissection. 


1 


2. INCOMPLETE NESTED DISSECTION ORDERINGS 


Following [ 2 ] , let V te the set of nodes of the n by n mesh and let 

consist of the set of nodes on a vertical mesh line which as nearly as possible 

1 12 

divides the mesh into two equal parts R^^ and R^, where R^ u R^^ = R^^ = V\C^. 

1 2 

Numbering nodes in R^, followed by those in R^, followed finally by those in 
C^, induces the following block structure in the reordered matrix A. 


( 2 . 1 ) 


A - 


^11 


^13 


22 


23 


13 


23 


33 


Now choose vertex sets c R^, = 1,2, consisting of nodes lying on a horizontal 

mesh line which as nearly as possible divides R^^ into two equal parts. If we 

H Z 

number the variables associated with the vertices in those asso- 

Z 

ciated with S^, £ = 1,2, and then the remaining nodes as before, we induce the 
7 by 7 partitioning in A shown in (2,2). 


i' 

U 

a 
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The node sets of the mesh corresponding to the i-th diagonal block are 
depicted schematically in Figure 2.1. So far we have not specified how to number the 
nodes indicated by 1-4 in Figure 2.1. 



Figure 2.1 A one-level dissection of the mesh. 



We could 'cepeat the dissection process just described, numbering the nodes in 
mesh regions 1-4 in the same manner (but on a smaller scale) as for the original 
n by n mesh. This would involve choosing three dissectors for each of the 
four regions, and would create sixteen mesh regions of size about n/16 by n/16 
whose numbering must still be specified. If we repeat this procedure until the 
choice of dissectors finally leaves no nodes left to number, we obtain a nested 

3 

dissection ordering [1]. These orderings yield 0(n ) arithmetic counts and 

2 i 

0(n log^n) fill for the factorization. Of course when n 7^ - 1, j a posi- 

tive integer, this dissection process will not be completely uniform, and at 
the i-th dissection stage the node sub-arrays whose numbering is yet to be pre- 
scribed may not all be exactly the same size. We will refer to these sub-arrays 
as J?- level node arrays . 

Suppose we stop the dissection process sooner than necessary, say at the jl-th 

2 Z9j 

level, and simply number the n /2 A-level node arrays grid row by grid row (or 

grid column by grid column) , We will refer to such an ordering of the n by n 
grid as an incomplete nested dissection ordering . For example, if the nodes spe- 
cified by regions 1-4 in Figure 2.1 were numbered column by column, this would be a 
one-level incomplete nested dissection ordering. The standard band ordering is a 
zero-level dissection ordering. 

For a given i-level incomplete nested dissection ordering, the dissectors 
together with the £-level mesh subarrays induce a partitioning of the nodes into y 


5.-1 


y = 2- 




2-^ = 2^ ^ + (2- ^ - 1) . 


k=0 


subsets, where 


! 



21 

The matrix A has 2 leading diagonal submatrlces whose sizes are .-tbouc 

£ 2 £ 

(n/2 ) and whose bandwldths are about n/2 ; these correspond to the last-level 

mesh sub-arrays. The remaining diagonal submatrices corretpond to the dissectors. 
Although they are Initially sparse. It can be shown that by the time they are fac- 
tored, they have become full [2]. 

In order to show how we arrived at our timing formulas, consider Figure 2.2 
which displays the approximate structure of the one-level dissection ordering and 
partitioning specified by (2.2). The manner in which the last-level mesh subarrays 
were numoered is indicated by the arrows in Figure 2.1. 



1 2 3 4 5 6 7 


Figure 2.2 The approximate structure of A is shoi/n in the 
upper triangle, and the structure and the storage format of 
the corresoonding factor L are indicated by the hashing in 
the lower triangle. 
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i 



Observe that the "block columns" corresponding to the g diagonal blocks of 
A have sets of adjacent non-null rows creating non-null rectangular subarrays. Fur- 
thermore, these non-null subarrays are of two general types: those lying in block 

columns corresponding to dissectors are full , but those lying in block columns 
corresponding to last-level node arrays have some zeros and vary in structure as 
seen in Figure 2.2. 

Now consider the depiction of a three level dissection ordering contained in 

Figure 2.3. Notice that there are three different kinds of last-level node subarrays, 

which for obvious reasons we will refer to as comer , side , and interior node arrays. 

These correspond to leading diagonal blocks of A, all with bandwidth about n/2 

and size about (n/2 ) . However, the number of non-null submatrices in their 

respective block columns of L differ. Similarly, the "+" shaped dissector triples 

with labels r, 1 S r < £ are of three corresponding types, with corresponding 

similarities and differences. The character of the block columns corresponding to 

the last level blocks is displayed in Figure 2,4. The row dimension of the sub- 

% 

arrays in each such block column is about n/2 . Note that many of the components 
of the off-diagonal subarrays in Figure 2.4 are zero. \'?hen these zeros are in 
leading positions in their rows, the number of operations required for the factor- 
ization is significantly reduced. For example, if £ = 1 or 2, the total time for 
the factorization can be reduced by as much as 50%. The locations of the zeros are 
determined by the manner in which the nodes of the last level subarrays are ordered. 
This observation was made by Noorj et al. in [5]. Assuming that a band 
solver is to be used for factoring the diagonal blocks, the optimum row by row 
ordering is attained by numbering toward the dissectors. For example, for a 
corner array, this leads to the numbering indicated by the arrows in Figure 2.1 
and the off-diagonal blocks shown in Figure 2.4. The block columns 


corresponding to comer, side, and interior dissectors are similar but the matrix 
subarrays in the block columns are full. 
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Figure 2.3 Depiction of a three level dissection ordering, 
where nodes with number k are numbered before nodes numbered 
k-1, and vertical node sets are numbered after horizontal node 
sets with Che same label. 
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block column corresponding to a last-level comer subarray of nodes 



Figure 2.4 Diagram showing the character of the block columns 
of L corresponding to last-level comer, side, and interior node 
sub arrays . 


- 10 - 


Our operation counts are obtained under the assumption that the banded charac- 
2Z 

ter of the leading 2 diagonal blocks is exploited and the leading zeros of the 
off-diagonal blocks are exploited. Our basic strategy in obtaining operation counts 
is to determine the number of multiplications and divisions it takes to execute the 
factorization steps corresponding to each individual diagonal block. In order to do 
this for the leading diagonal blocks we need to establish the structure as well as 
the size of the submatrices in the corresponding block columns. However, since the 
submatrices in the block columns corresponding to dissectors are full, we need only 
be concerned with the total number of non-null rows in these block columns. Let S 
be a dissector which was chosen to subdivide the node subarray R. Then the number 
of non-null rows in the block column of L corresponding to S is | 3S | , where 
3S is the set of nodes which are not in R but share at least one grid square with 
some node in R (a full explanation of this recipe can be found in [2]). Thus, the 
computation required to perform the |sj factorization steps corresponding to the 
set S is equivalent to carrying out the first jsj steps of the factorization of 
the Js[ + [3S| by |s| + |9s| block two by two matrix shown below, which becomes 
what is shown at the right if we compute the partial factorization "in place". 
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I 

T -1 T 

Here ^3S " ^9S ~ denote the number of operations 

required to carry out this computation by T_(ls|, j3s|,k) where k denotes the 

r 

number of dissectors connected to S. Similarly, the number of operations for the 
lower solve attributable to the block column of L corresponding to S is denoted 
by T^(|s|, |9s|,k), and for the upper solve, Ty(|s|, |9|,k). Details can be 
found in Appendix C of [3]. 

Now consider the computation involved in carrying our the factorization steps 

2 

corresponding to a last-level interior node array having p factorization steps 

2 2 2 2 
of the p + 4(p+l) by p + 4(p+l) matrix below, where is p by p with 

bandwidth p+1, and the other diagonal blocks are either p by p or p+2 by p+2. 

For convenience, we assume they are all p+1 by p+1, since this will not change 

our estimates by very much. 



The partial factorization, with the parts of L left "in place" in the array 
becomes 
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(2.4) 



where 



(2.5) 


"u = 


2 < i < 5, 


and 


(2.6) a! . = A. . - W^.W, 2 < i < 5, i < j < 5. 

ij ij li Ij 

Now in carrying out the computation (2.5), we exploit the fact that A^^^ has lead- 
ing zeros as indicated in Figure 2.4, and we also exploit the structure in the 
W^^'s when performing the matrix multiplication in (2.6). In addition, we of 
course exploit the banded character of A^^^ when we factor it and in (2 .5) , and we 
also exploit symmetry wherever possible. We denote the number of operations required 

to carry out the above computation by T (p), where the superscript implies that the 

r 

computation is associated with a last level _interior p by p node subarray. In 

C S 

a similar way, we can derive functions T (p) and T (p) for last level corner 

F F 

and side node subarrays. The subscript F denotes factorization; similar functions 
with subscripts L and U can be defined corresponding to the work attributable 
to the block column during the fower and upper solve. 

Having made these observations, using Figure 2.3 as a guide, it is now possible 
to derive our operation counts. They arc only entioaten because we assume 
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n = 2^ - 1 so that the dissection is uniform. However, some independent work has 
shown that estimates obtained using this strategy are extremely good for general n 
(see [2], Table 3.2). The details of these formulas may be found in Appendix C of [3]. 

The formula T (q,p,k) represents the n’traber of operations required to eliminate 
r 

the nodes, corresponding to a particular dissector. The first variable, q, repre- 
sents the number of nodes in the dissector; p denotes the number of other nodes 
which are connected to the dissector; and k is the -number of other dissectors 
connected to the dissector (see C.6.13 of Appendix C in [3]). Obviously, the number 
of operations required to execute the last 2n - 1 steps of the factorization is 


(2.7) 


Tj.(n,0,0) + 2 Tj,[ 


where the first term is due to the vertical line of nodes labelled 1, and '"he 
second is due to the two horizontal lines of nodes labelled 1. The number of opera- 
tions required for the factorization steps corresponding to the nodes labelled 2 
is given approximately by 



In general, the number of operations required to eliminate nodes labelled 
2 £ r £ 2 is given approximately by 


(2.9) 






r 41 2^-^ - 1 




^n-2^-fl 5(n+l) , . 

r * r ^ 

V 2 


Fl ^r-1 


2^ 2^ j 2^ 2^ 


J 


- 1 


■•)] 
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4 


Here the first, second and third terms are due to interior, side, and comer dissec- 
tor triples respectively. 

The number of operations required for the factorization steps corresponding to 

S, 2. 

the last-level blocks is given approximately by the following, where ^ ■ (n-2 +l)/2 . 


( 2 . 10 ) 


(2^ - 2)^p(0 4(2^ - 2)Tp(C) + 4Tp(0 


Using the symbolic computation system MACSYMA* [6] to sum (2.9) from r ■ 2 to 
I and adding in (2.7) and (2.10) yields the estimate (see Appendix B of [3] for 
details) : 

if t - 0, 


(2.11) 


Mp(n,0) Ci ^ n^ + n^ + 2n^ - n 


if i > 0, 

(2.12) Mj,(n,0 ~ nY^ x - 10 x 2 '^^ + 6 * 2‘^y 


^ „3/||9 . ^ ^ X 2-2^ - ^ X 2-3^ . 24 x 2-^) 


^ am . i|3 , 2-^ ^ 86 X 2 


■21 664 ^-32 ^ ,, »-4i^ 

— X 2 + 36 * 2 J 


1657 . 325 


+ a(- . 2 * + 6 . i - ^ 


123 ,-22 524 ,-32 


X 2 


X 1 


f 24 X 2"^‘) 


L ,22 17 


,2 . , 2459 935 ,-2 ^ 4 ,-22 

X 2 + 23 * 2 — + X 2 — * 2 

41 12 3 


454 ,-32 ^ , ,- 32 ^ 

- jj- X 2 -*-6x2 i 


*MACSYMA is supported by the Defense Advanced Research Projects Agency work 
order 2095, under Office of Naval Research Contract 4N00014-75-C-0661. 
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Similarly, an estimate for the lower solve, Mj^(n,A), is obtained by summing exactly 
the same terras with replaced by We obtain, for i = 0, 

(2.13) M^(n,5.) ci n^+|n^-|n 
For Jl > 0, 

(2.14) Mj^(n,£) ci n^(3 x 2”^ - 4 x + 2 x 2"^^) 

+ n^ X i - -p + 31 X 2"*' - 18 X 2“^^ + 6x2" 

+ n ~ X 2^ + X A - 13 + 39 X 2"^ - 24 X 2~^^ + 6 x 2~^^) 

+ ( 2 ^^ + I X 2^ - X ii - ii + 11 X 2 "^ - 10 X 2"^^ + 2 X 2 "^^) 



The number of multiplications for the upper solve, as well as the storage 

requirements (see section 3) are also given by (2.14). 


3. OPERATION AND STORAGE REQUIREMENTS ’ 

It is not obvious which are the dominant terms of (2.12). For example, 

(3.1) Mp(n,l) 0.292n^ + 2.54n^ + 0(n“) 

4 

and the n term is dominant for n > 8. On the other hand, 

(3.2) Mp(n,3) 0.055n^ + 6.87n^ + O(n^) 

4 3 

and n must be greater than 125 before the n term is larger than the n term. 
Finally, for the largest value of 1, log 2 (n+l) - 1,* we get 

* j 

In this section we assume that n = 2 - 1, for j an integer, in order to 

simplify the analysis. The general observations hold for an arbitrary value of n. 


I 
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(3.3) ^ 9.87n^ - ITn^log^ (n+1) + 22. In^ + O^n log 2 (n-H)) 


The estimate (3.3) for the multiplication count of nested dissection differs only 
slightly in the third term from a similar formula given in (2). The difference can 
be attributed to a slight improvement in the estimate of the multiplication count. 

Table 3.1 and Figure 3.1 display the multiplication counts for the factori- 
zation stage of the Incomplete nested dissection ordering. The number of operations 
decreases quite rapidly for small I as the level of dissection increases. How- 
ever, as I approaches its largest possible value, very little is gained. For all 
of the cases listed in Table 3.1, the final step of dissection decreases the multi- 
plication count by less than 5%. On the ocher hand, the last step increases Che 
complexity of data management, as will be described later. 



15 

31 

63 

127 

255 

banded 

31.1 (3) 

512 (3) 

8.30 (6) 

133.0 (6) 

2142 (6) 

1 

23.9 (3) 

347 (3) 

5.24 (6) 

81.1 (6) 

1230 (6) 

2 

23.9 (3) 

290 (3) 

3.71 (6) 

51.3 (6) 

753 (6) 

3 

23.0 (3) 

243 (3) 

2.53 (6) 

28.1 (6) 

345 (6) 

4 


235 (3) 

2.20 (6) 

20.5 (6) 

202 (6) 

5 



2.16 (6) 

18.9 (6) 

165 (6) 

6 




18.7 (6) 

157 (6) 

7 





156 (6) 


Table 3.1 Multiplication count for factorization stage of 
incomplete nested dissection and banded orderings for several 
values of n and i ■ 1 (1) j_log,(n+l) - ij. 


Now consider multiplication counts for the lower solve, as displayed in Table 
3.2 and Figure 3.2 (the counts for the upper solve are the same). Since each com- 
ponent of the matrix L is handled exactly once in the lower solve, it is obvious 
that the storage requirements for the components of the matrix are identical to the 
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mulclplicacion count cf the lower solve. The remaining comments in this section 
will be concerned with storage although analogous comments can be made about the 
lower and upper solves. 



15 

31 

63 

127 

255 

banded 

3.58 (3) 

30.7 (3) 

254 (3) 

2.064 (6) 

16.65 (6) 

1 

2.92 (3) 

24.1 (3) 

195 (3) 

1.568 (6) 

12.56 (6) 

2 

2.63 (3) 

19.9 (3) 

151 (3) 

1.162 (6) 

9.11 (6) 

3 

2.54 (3) 

17.1 (3) 

113 (3) 

.793 (6) 

5.84 (6) 

4 


16.5 (3) 

97 (3) 

.586 (6) 

3.83 (6) 

5 



95 (3) 

.509 (6) 

2.87 (6) 

6 




.499 (6) 

2.53 (6) 

7 





2.49 (6) 


Table 3.2 Multiplication counts for lower solve (sane for 
upper solve and storage) stage of incomplete nested dissection 
and banded orderings for several values of n and 
1 - 1 (1) [log^Cn+D - Ij. 


As was the case for the factorization stage, very little is gained by the 
final step of dissection. The storage requirement drops less than 5% from the 
penultimate step to the last step for all values of n shown in Table 3.2. More- 
over, the following analysis of the storage requirements of an implementation of 
incomplete nested dissection shows no decrease at all. 

Consider an implementation scheme for incomplete nested dissection analogous 

to that described by George [2] for nested dissection. George’s scheme requires 
2 

about 14b + 2n additional memory locations for information describing the blocks 
of L, where b is the number of diagonal blocks. This "data management" infor- 
mation includes dimensions of blocks, location of the first element in eacn row. 
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and this additional 


2 

pointers, etc. For nested dissection b is approximately ^ 
storage is 

9n^ + 0(n) . 

Now consider incomplete dissection where the dissection is stopped 1 short 
of completion — I = log^Cn+l) - 2. As mentioned in section 2, the number of dia- 
gonal blocks is 

2 

(3.4) ■ M = 2xZ^'^- 1=2- + 0(n) . 

o 

It follows that George's scheme requires only 

IS 2 

+ 0(n) 

21 2 

locations. By stopping short 1 step, one may save as much as — n locations 
for data management storage. Taking this into account in Table 3.2, one sees that, 
assuming George’s implementation scheme, the final step of dissection (i.e., nested 
dissection) is considerably more costly in terms of total storage than stopping 
short 1 step. Moreover, if one takes into account the storage overhead, it can be 
shown that stopping short by 2 steps is competitive with nested dissection. 

In summary, it follows that the penultimate level of the variant of nested 
dissection may reduce storage requirements of nested dissection with very little 
increase of computer time. 
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